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These last years, much attention has been paid to the construction of 
model selection criteria via penalization. Vladimir Koltchinskii has to be 
congratulated for providing a theory reaching a level of generality that is 
sufficiently high to recover most of the recent results obtained on this topic 
in the context of statistical learning. Thanks to concentration inequalities 
and empirical process theory, we are now at a point where the problem of 
understanding what is the order of the excess risk for the empirical mini- 
mizer on a given model is elucidated. Koltchinskii's paper provides several 
ways of expressing that this excess risk can be sharply bounded by quanti- 
ties depending on the complexity of the model in various senses. The most 
prominent relies on Rademacher processes, which Vladimir Koltchinskii him- 
self pioneered in introducing in statistics. We even know that these upper 
bounds on the excess risk are often unimprovable (see the lower bounds in 
[6], e.g.). 

The same machinery used to analyze the excess risk can be applied to 
produce penalized criteria and to establish oracle-type risk bounds for the so- 
defined penalized empirical risk minimizer. The problem of defining properly 
penalized criteria is particularly challenging in the classification context, 
since it is connected to the question of defining optimal classifiers without 
knowing in advance the "noise condition" of the underlying distribution 
[(8.2) of the discussed paper]. This condition determines the attainable rates 
of convergence and is a topic attracting much attention in the statistical 
learning community at this moment (see the numerous references in the 
discussed paper). 

What we would like to discuss is the gap between theory and practice 
of model selection. Of course, the existence of a gap between the methods 
which are analyzed in theory, and those which are used in practice, is in 
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some sense unavoidable. Our purpose here is to express our perception of 
the current situation regarding this gap, and to propose some ideas which 
could contribute to reduce it. 

As a starting point for our discussion, we would like to briefly analyze the 
behavior of the so-called hold-out selection procedure. This procedure should 
be seen as some primitive version of the l^-fold cross-validation method, 
which is probably the most commonly used model selection method in prac- 
tice, in the context of statistical learning. One advantage of hold-out is that 
it is very easy to study from a mathematical point of view. The point we 
want to make here is simple, but rich of teachings in this context: hold-out is 
actually a selection method that is adaptive to the classification noise con- 
dition. This property of the hold-out procedure does not seem to be widely 
known. Since the proof is short and disarmingly simple, we reproduce it here 
(as inspired from [5], Chapter 8, where more general results for hold-out are 
also proved). 

1. Hold-out adapts to the noise condition. Our analysis is based on the 
following selection theorem among a finite collection of functions, which can 
be seen as some very elementary and basic version of Theorem 6 of the dis- 
cussed paper. In what follows, we stick to Vladimir Koltchinskii's notations 
and conventions; in particular we use (5.3), to express the (unknown) noise 
conditioning [see also the related equation (8.2)]. 



1.1. A basic selection result. 



Theorem 1. Let {fm,'m £ M.} be a finite collection of real-valued mea- 
surable functions defined on some measurable space X and with \ > 2. Let 
be some i.i.d. random variables with common distribution P and 
denote by Pn the empirical probability measure based on ■ ■ ■ ,£,n- Assume 
that \fm — fm' \ < 1 for every m,m' £ A4. Assume furthermore that Pfm ^ 
for every m£ M. 

Let if be a convex function on [0, +oo) with (/^(O) = and such that 
ip{x)/x'^ is nondecreasing; denote ip* the convex conjugate of (p. Assume 



(1) Pfm>'p[\JPf^) for every me M. 

Consider some random variable fh such that 

Pnf^= inf Pnfm- 
m£A4 

Then, for every e £ (0,1), the following exponential bound holds for every 
positive real number x: 



(2) P 



Pf^^ > a inf Pfm + CUx + In \M\)(U* (^) + ^ 
m£M \£ \\/nJ 3n 



< e~ 
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where Ce = ^,, C', = {1- e^^ . 

In particular, the following control in expectation is valid: 

(3) E[F/-1 < C. E[P/„1 + C; l„(e|A<|) (1^- + -L). 

Proof. Let m be such that 

Pfm= inf (P/™')- 

m'eM 

Notice that by definition of m, Pnfm < Pnfm- Hence, 

= iP- Pn)fm + Pnfm < Pnfm + {P - Pn){f^) 

Setting for every m' G , cj^, = Pf^i , it comes from Bernstein's inequality 
that for every m' G Ai and every positive number y, the following holds 
except on a set of probability less than e~^: 

{P - Pn){fm' - fm) < \/ — (o"m + CTm') + ■ 

V n Sn 

By the union bound, choosing y = ln|7W| + x, and using (1), this implies 
that, except on some set ilx with probability less than , 

(5) iP - Pn)Um - fr.) < {^{^~\Pf^) + ^"'{P fm)) + M^^LM. 

Let be the convex conjugate of 99; we then have 

M{^-\Pf^)) < ^{^e^-\Pfm)) + ^* (M) < ePfm + (^) , 
Vn Vven/ £ \ \Jn J 

with a similar inequality for m. For the last inequality above, we have used 
the assumption that ip{x)/x'^ is nondecreasing, which readily implies that 
ip*{x)/x'^ is nonincreasing, along with the fact that e < 1 and 2y/e > 1. 
Combining this inequality with (5) and (4) yields 

(l-e)P(/J<(l + e)P(/™) + (x + ln|M|)(^4e-V*(^)+^) ■ □ 

1.2. An oracle inequality for hold-out. Let us now describe and study the 
hold-out procedure. Assume that we observe N + n random variables with 
common distribution P depending on some parameter g^, to be estimated. 
The first observations , . ■ . , Ctv used to build some preliminary collec- 
tion of estimators {gm}m£M ai^d we use the remaining observations Cit ■ ■ iCn 
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to select some estimator among the collection {gm}meM- We more pre- 
cisely consider here the situation described in Section 7 of the paper, where 
there is some (bounded) loss or contrast 

i:T xM.^ [0,1] 

which is well adapted to our estimation problem of in the sense that 
the expected loss E[^»(7] = K[i(Y,g{X))] achieves a minimum at when g 
varies in Q. We denote the relative expected loss as follows: 

Lig,g*)=^[l»g-i»g*] for all gG^. 
For bounded regression or binary classification, we can take, for example, 

({y,x) = {y-xf; 

then g*{x) = P{Y = 1\X = x) (resp. g^: = b*, the Bayes classifier) is indeed 
the minimizer of E[(y — t(X))^] over the set of measurable functions g taking 
their values in [0, 1] (resp. {0, 1}). We can now apply Theorem 1, condition- 
ally on the training sample ■ ■ ■ to the collection of functions 

{fm = {^»9m -^•g*),m£ M}. 

Define, as in the theorem, m as a minimizer of the empirical risk Pni^*dm) 
over M. If (p satisfies the weak regularity assumptions of Theorem 1 and is 
such that 

(6) sup g - i» g*\\2,p <^~^{£), 

L{g,9*)<e 

we derive from (3) that conditionally on (,[,■■■ one has for every e £ 
(0,1): 

(7) E[L{g-,g,M']<amf^L{grn,g.)+C'Me\M\)^^^^^ 

The striking feature of this result is that the hold-out selection procedure 
provides an oracle-type inequality involving the modulus of continuity (p~^ 
which is not known in advance. This is especially interesting in the classifi- 
cation framework for which ip can be of very different natures according to 
the difficulty of the classification problem. The main issue is therefore to un- 
derstand whether the term (/?*(n~^/^)(l -|-ln|A^|) appearing in (7) is indeed 
a remainder term or not. We cannot exactly answer this question in general 
because it is hard to compare (5„ := (^*(n~^/^) with inf^GAl L{gm.,g*)- How- 
ever, if gm is itself an empirical risk minimizer over some model Gm, we can 
compare (5„ with m{m£M0m,N, where 6m,N is an upper bound (up to con- 
stant) for the expectation of the excess risk within model Qm- More precisely, 
taking for instance e = 1/2, we derive from (7) that for some constant k 

(8) E[e{g^^,g,)]<3 inf (L{g^,g,) + k9^^n) +He\M\){165n + {3ny^), 
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where L{Qm,g*) = ^^^geg,n L{g, g^,). Now, using the notation and method 
described in Section 3 of the discussed paper, we obtain cis di value for O^n n 
the (largest) fixed point of the function 

where (j)m,N is a nondecreasing function which more or less plays the role 
of a modulus of continuity of the empirical process {P^ — P){£» g) over the 
model Qm. If N and n are of the same order of magnitude, say = n to be as 
simple as possible, then ul{^\5) > KD[5)n~^/'^ > K'ip~^{5)n~^^'^ , where we 
assumed that (6) is sharp up to a constant. Therefore, Om,n is surely larger 
(again up to constant factor) than the solution 5„ of (p{S) = 6n~^/'^. Since f 
is a nonnegative convex function with ip{0) = 0, elementary considerations 
then show that 6n>Sn- 

In fact, 6m,n will typically turn up much larger in magnitude (as a func- 
tion of n) than 5n, since the fixed point equation for 6m,n also involves the 
function <j)m,n, which measures in some way the complexity of model m. Fi- 
nally, the factor ln|A^| appearing above should also be considered minor if 
we assume, for instance, that \A4\ < for some A; > 0. In conclusion, except 
in very pathological situations, the quantity (5,iln|Al| really plays the role 
of a remainder term in (8). 

2. Data-driven penalties. 

2.1. A sober assessment of the current state of the art. It is, in some 
sense, somewhat disappointing to discover that a very crude method like 
hold-out is working so well. This is especially true in the classification frame- 
work, where it is indeed painstakingly difficult to design penalties that are 
adaptive to the noise condition. Recent works on the topic, involving local 
Rademacher penalties for instance, provide at least some theoretical solu- 
tions to the problem; but they systematically involve unknown constants — 
either because the numerical values coming from the theory are overpes- 
simistic, or, worse, because these constants also depend on nuisance param- 
eters related to the unknown distribution (e.g., the infimum of the density 
of explanatory variables). 

We therefore end up in the following delicate situation: 

• From a theoretical point of view, we are not in a position to justify that 
conveniently penalized model selection methods (or more generally, model 
selection methods that use the entire sample for the estimation within each 
model) could improve over the simple hold-out solution. 

• Prom a practical point of view, the penalization method does not provide 
a "ready-to-use" solution and remains far from being competitive with 
relatively simple methods that are widely used in practice. We have in 
mind in particular F-fold cross-validation. 
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At this stage, two natural and connected questions emerge: 

• Is there some room left for penalization methods? 

• How should penalties be calibrated to design efficient procedures? 

There is at least one strong reason for which, despite the arguments de- 
veloped above, one should keep interest in penalization methods: for in- 
dependent but not identically distributed observations (we typically think 
of Gaussian regression on a fixed design), hold-out (for theory) or cross- 
validation (for practice) may break or become irrelevant. 

Another issue is that, intuitively, one would expect that the hold-out leads 
to the loss of a factor 2 because of the sample halving process. Unfortunately, 
much looser constants appear when applying the more elaborate theoretical 
tools needed to tackle penalization, where the entire sample is used for the 
estimation within each model. These larger constants drown out the initial 
factor 2 advantage over the hold-out, so that the theory may currently not 
be precise enough to distinguish this effect. 

In other words, since the opponents are strong, beating them remains 
possible but requires one to calibrate penalties sharply. This leads us to 
the second question raised above. We would like now to provide some ideas 
that can contribute to answering this last question, partly based on theoret- 
ical results which are already available and partly based on heuristics and 
thoughts which lead to some empirical rules and new theoretical problems. 

2.2. A rule of thumb for calibrating penalties from the data. A general 
idea consists in guessing what is the right penalty to be used from the data 
itself. Let us roughly describe the type of results which been proved in the 
Gaussian framework in [2]. In several contexts (such as variable selection, 
e.g.), it is possible to prove lower bounds for penalties (meaning that lower 
penalties will lead to asymptotic inconsistency). Moreover, a close inspec- 
tion of oracle inequalities shows that approximately optimal values for the 
penalty are linked to minimal values within a factor 2. We can therefore 
retain from this Gaussian theory the rule of thumb: 

(9) ^'optimal" penalty = 2 x ^^minimaF penalty. 

Interestingly, the minimal penalty can be evaluated from the data: when the 
penalty is not heavy enough, one systematically chooses models with very 
large dimension. It remains to double this minimal penalty to produce the 
desired (nearly) optimal one. This strategy allows to design a data-driven 
penalty without knowing in advance the level of noise in Gaussian regres- 
sion. In the context of change points detection, this data-driven calibration 
method for the penalty has been successfully implemented and tested by 
Lebarbier (see [3]). 
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In the non-Gaussian case, we believe that this procedure remains vahd, 
but theoretical justification remains an open problem. As already mentioned 
earlier, this problem is especially challenging in the classification context, 
since it is connected to the question of defining optimal classifiers without 
knowing in advance the noise condition of the underlying distribution. 

2.3. Akaike 's heuristics revisited. In order to better understand the above 
rule of thumb and understand why it could be extended to non-Gaussian 
frameworks, it is instructive to come back to the original ideas of model 
selection via penalization, that is, Mallows's or Akaike's heuristics (see [4] 
and [1]). Both are based on the principle of unbiased estimation of the risk 
(at least asymptotically as far as Akaike's heuristics is concerned). Our idea 
is to adapt this principle to a nonasymptotic view of the question, for which 
one could hope to use concentration inequalities rather than limit theorems 
to validate the heuristics. 

Let us consider, in each model Qm, some minimizer of g • g] 

over Qm (assuming that such a point does exist). Defining for every m£ M, 

bm = Pn{^» gm- g*) and Vm = Pn{^» gm- ^•dm), 

minimizing some penalized criterion P„(^»^m) +pen(m) over M amounts to 
minimizing bm — Vm + pen(m). The point is that hm is an unbiased estimator 
of the bias term L{gm,g*)- With concentration arguments in mind, one can 
hope that minimizing the above quantity will be approximately equivalent 
to minimizing L{gm,g^) — ^[vm] + pen(m). Since the purpose of the game is 
to minimize the risk E,[L{gm, g*)], an ideal penalty would therefore be 

pen(m) =E[vm] +E[L{gm, gm)]- 

In Mallows's Cp case, i is the square loss, the models Qm are linear and 
IE[t'rri] = E[L{gm, gm)] are explicitly computable (at least if the level of noise 
is assumed to be known). For Akaike's penalized log-likelihood criterion, this 
is similar, at least asymptotically. More precisely, in Akaike's heuristics, i is 
the (minus) log-likelihood and one uses the fact that ~ E[L{gm, gm)] ~ 

Dm/{2n), where Dm stands for the number of parameters defining model 

Of course, we do not want to take in consideration the second approx- 
imation, which is typically asymptotic and relies on the specific choice of 
the log-likelihood loss, as well as on regularity conditions of the parametric 
models, that we certainly do not want to assume here. Our guess, however, 
is that one can trust the first approximation E[?;m] ~ E,[L{gjn, gm)] in a more 
general situation. If one believes in the validity of this approximation, then 
a good penalty is 2E[7;m], or equivalently (having still in mind concentration 
arguments) 2vm- This, in some sense, explains the rule of thumb which is 
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given in the preceding section: the minimal penalty is Vm, while the optimal 
penalty should be Vm + ^[L{gm, 9m)], and their ratio is approximately equal 
to 2 [note that Akaike's criterion itself may be interpreted by formula (9), 
the minimal penalty being taken as Z)m/(2n) and the optimal penalty as 
Dm/n]. As mentioned above, the interesting point is that, even though Vm 
is not observable, we can guess the minimal penalty from the data anyway. 
One way to do this in practice is to search a minimal penalty of the form 
pen(m) = aD^ and estimate a by choosing the smallest value for which 
the corresponding penalized criterion does not lead to selecting "very large" 
models. Of course, concentration arguments will work only if the list of mod- 
els is not too rich. In practice, this means that, starting from a given list 
of models, one has first to decide to penalize in the same way the models 
which are defined by the same number of parameters. Then one considers a 
new list of models {Gd)d>i, where for each integer D, Qu is the union of 
those among the initial models which are defined by D parameters and then 
applies the preceding heuristics to this new list. 

3. Conclusion. Caricaturing a little, we could say that at this point, we 
have a beautiful, yet not very useful, theory — at least, this is the conclusion 
to which a person mainly interested in practical applications could come. 
Hold-out is our nemesis for theory, as is cross-validation for practice. More- 
over, note that hold-out is also known to be quite unstable in practice — this 
is the reason why cross-validation is preferred — which widens the gap the- 
ory/practice yet a little more. 

An optimistic way to look at this, though, is to say that we are only 
half-way climbing the slope, and that many interesting problems are open 
to future research efforts. Obviously, there are at least two directions of 
research. The first one consists in designing proper data-driven penalties 
which are ready to be used in practice and theoretically efficient (we have 
tried to give some ideas in this direction in the preceding section). In the 
spirit of the above results on hold-out, the second one consists in studying 
in depth the theoretical properties of y-fold cross-validation. 
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